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Abstract — In this paper, we consider the distributive queue- 
aware power and subband allocation design for a delay-optimal 
OFDMA uplink system with one base station, K users and 
Nf independent subbands. Each mobile has an uplink queue 
with heterogeneous packet arrivals and delay requirements. We 
model the problem as an infinite horizon average reward Markov 
Decision Problem (MDP) where the control actions are functions 
of the instantaneous Channel State Information (CSI) as well 
as the joint Queue State Information (QSI). To address the 
distributive requirement and the issue of exponential memory 
requirement and computational complexity, we approximate the 
subband allocation Q-factor by the sum of the per-user subband 
allocation Q-factor and derive a distributive online stochastic 
learning algorithm to estimate the per-user Q-factor and the 
Lagrange multipliers (LM) simultaneously and determine the 
control actions using an auction mechanism. We show that under 
the proposed auction mechanism, the distributive online learning 
converges almost surely (with probability 1). For illustration, we 
apply the proposed distributive stochastic learning framework to 
an application example with exponential packet size distribution. 
We show that the delay-optimal power control has the multi- 
level water-filling structure where the CSI determines the in- 
stantaneous power allocation and the QSI determines the water- 
level. The proposed algorithm has linear signaling overhead and 
computational complexity 0(KN), which is desirable from an 
implementation perspective. 



I. Introduction 

There are plenty of literature on cross-layer optimization 
of power and subband allocation in OFDMA systems lUl, 
El . Yet, all these works focused on optimizing the physical 
layer performance and the power/subband allocation solutions 
derived are all functions of the channel state information 
(CSI) only. On the other hand, real life applications are delay- 
sensitive and it is critical to consider the bursty arrivals and 
delay performance in addition to the conventional physical 
layer performance (such as sum-rate or proportional fair) in 
OFDMA cross-layer design. A combined framework taking 
into account of both queueing delay and physical layer perfor- 
mance is not trivial as it involves both the queueing theory (to 
model the queue dynamics) and information theory (to model 
the physical layer dynamics). The first approach converts the 
delay constraint into average rate constraint using tail probabil- 
ity at large delay regime and solve the optimization problem 
using information theoretical formulation based on the rate 
constraint ||3l, H. While this approach allows potentially 
simple solution, the derived control policy will be a function of 
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the CSI only, which is good only for large delay regime where 
the probability of buffer empty is small. In general, delay- 
optimal control actions should be a function of both the CSI 
and queue state information (QSI). In O, the authors showed 
that the Longest Queue Highest Possible Rate (LQHPR) policy 
is delay optimal for multiaccess fading channels. However, the 
solution utilizes stochastic majorization theory which requires 
symmetry among the users and is difficult to extend to other 
situations. In (SI, (Tl, the authors studied the queue stability 
region of various wireless systems using Lyapunov drift, which 
is good only for large delay region. While all the above works 
addressed different aspects of the delay sensitive resource 
allocation problem, there are still a number of first order issues 
to be addressed to obtain decentralized resource optimization 
for delay-optimal uplink OFDMA system. 

• The Curse of Dimensionality A more general approach 
is to model the problem as a Markov Decision Problem 
(MDP) (H, 191. However, a primary difficulty in deter- 
mining the optimal policy using the MDP approach is the 
huge state space involveqj. For instance, the state space 
is exponentially large in the number of users. Consider a 
simple example. For a system with 4 users, 6 independent 
subbands, a buffer size of 50 per user and 4 channel 
states, the system state space contains 4^^^ x (50 + 1)^ 
states, which is already unmanageable. 

• Decentralized Solution Most of the solutions in the 
literature are centralized in which the processing is done 
at the base station iTTOll requiring global knowledge of CSI 
and QSI from K users. However, in the uplink direction, 
the QSI is only available locally at each of the K users. 
Hence, centralized solution at the BS requires all the K 
users to deliver their QSI to the BS, which consumes 
enormous signaling overhead, and the BS to broadcast 
the allocation results for the resource allocations at the 
Mobile side in the uplink system. In addition, the cen- 
tralized solution also leads to a exponential computational 
complexity to the BS. 

• Convergence of Stochastic Iterative Solution There are 
a number of works on decentralized OFDMA control us- 
ing deterministic game ifTHl or primal-dual decomposition 
theory for solving deterministic NUM [.12.1 . The derived 



^As illustrated later, we could derive the closed form action given the Q- 
factor. Therefore, the curse of the dimension refers to the exponential growth 
of state space only. The dimensionality of the action space is not an issue. 
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distributive algorithms are iterative in nature where all 
the nodes exchange some messages explicitly in solving 
the master problem. However, the CSI is always assumed 
to be quasi- static during the iterative updates with mes- 
sage passing. When we consider delay-optimization, the 
problem is stochastic in nature and is quite challenging 
because the game is played repeatedly and the actions as 
well as the payoffs are defined over ergodic realizations of 
the system states (CSI, QSI). During the iterative updates, 
the system state will not be quasi-static anymore. 
In this paper, we consider an OFDMA uplink system 
with one base station (BS), K users and Np independent 
subbands. The delay-optimal problem is cast into an infinite 
horizon average reward constrained Markov Decision Process 
(MDP). To address the distributive requirement and the issue 
of exponential memory requirement and computational com- 
plexity, we propose a distributive online stochastic learning 
algorithm, which only requires knowledge of the local QSI 
and the local CSI at each of the K mobiles and determine 
the resource control actions using a per- stage auction. Using 
separation of time scales, we show that under the proposed 
auction mechanism, the distributive online learning converges 
almost surely. For illustration, we apply the proposed distribu- 
tive stochastic learning framework to an application example 
with exponential packet size distribution. We show that the 
delay-optimal power control has the multi-level water-filling 
structure where the CSI determines the instantaneous power 
allocation and the QSI determines the water-level. We show 
that the proposed algorithm converges to the global optimal 
solution for sufficiently large number of users. The proposed 
algorithm has linear signaling overhead and computational 
complexity 0{KN), which is desirable from an implemen- 
tation perspective. 




Fig. 1. OFDMA physical layer and queueing model. 



II. System Models 

In this section, we shall elaborate the system model, the 
OFDMA physical layer model as well as the underlying 
queueing model. There are one BS and K mobile users (each 
with one uplink queue) in the OFDMA uplink system with L 



subcarriers over a frequency selective fading channel with Np 
independent multipaths as illustrated in Figure [T] The BS has 
a cross-layer controller which takes the joint channel state 
information (CSI) and joint queue state information (QSI) 
as the inputs and produces a power allocation and subband 
allocation actions as outputfl 

We first list the important notations in this paper in table H 



K 


number of users 


Np 


number of independent subbands 


Nq 


buffer size 


k,n 


user, subband index 


Nk 


mean packet size of user k 


t 


slot index 


Sk,n<Pk,n 


subband, power allocation action 


a = (ap,as) 


power and subband allocation policy 


H = {|^fc,„|} 


joint CSI 


Q = (Qk) 


joint QSI 


A = (Afc) 


bit/packet arrival vector 


X=(H,Q) 


global system state 


r 


frame duration 


Afc 


average arrival rate of user k 


M*(x) 


conditional mean departure rate of user k 


Pk, Pt 
{V{X)} 


total power, packet drop rate of user k 


system potential function on x 


{a(x,s)} 


subband allocation Q-factor 


{Q\xk,^k)} 


per-user subband allocation Q-factor 


{q\Q,H,s)} 


per-user per-subband subband allocation Q-factor 


^k 


LM w.r.t. average power constraint of k 


1^ 


LM w.r.t. average pck drop constraint of k 


{e?} 


stepsize sequence for per-user potential update 


W} 


stepsize sequence for per-user 2 LMs update 



TABLE I 
List of Important Notations. 



A. OFDMA Physical Layer Model 

Let Sk,n ^ {O7I} denote the subband allocation for the 
k-th user at the n-th subband. The received signal from the 
k-th user at the n-th subband of the base station is given by 
^k,n = ^k,n{Hk,nXl^^^Zk,n), whcrc X^^ is the transmitted 
symbol, H^^ri and Z^^ni^ CA/'(0, 1)) are the random fading 
and the channel noise of the k-th user at the n-th subband 
respectively. The data rate of user k can be expressed as: 



Nf 



Nf 



Rk = ^ Rk,n = ^ 5/c,n log (l + (,Pk,n\Hk 



(1) 



n=l 



n=l 



for some constant ^. Note that the data rate expression in ([T]) 
can be used to model both the uncoded and coded systems. 
For uncoded system using MQAM constellation, the BER of 
the n-th subband and the /c-th user is given by ifTTIl BER^ n ^ 
ci exp(— C2 Rj^^^ — ), where Tk^n is the received SNR of the 
k-ih user at the n-th subband, and hence, for a target BER e, 
we have ^ = — ^^/^L ) • On the other hand, for system with 
powerful error correction codes such as LDPC with reasonably 
large block length (e.g 8Kbyte) and target PER of 0.1%, the 
maximum achievable data rate is given by the instantaneous 

^We first formulate the problem in a centralized manner and address the 
distributive solution in Section Hvl 



mutual information (to within 0.5dB SNR). In that case, ^ 

a 



B. Source Model, Queue Dynamics and Control Policy 

In this paper, the time dimension is partitioned into schedul- 
ing slots indexed by t with slot duration r. 

Assumption 1: The joint CSI of the system is denoted by 
H(t) = {|iY/c,n(^)|V/c,n}, where \Hk^n{i)\ is a discrete r.v. 
distributed according to Pr[|i!f |]. The CSI is quasi-static within 
a scheduling slot and i.i.d. between scheduling slotfl ■ 

Let A(t) = (Ai(t), • • • , Axit)) be the random new arrivals 
(number of bits) at the end of the t-\h scheduling slot. 

Assumption 2: The arrival process Akit) is i.i.d. over 
scheduling slots according to a general distribution Vr{Ak) 
with average arrival rate E[A/e] = \k- ■ 

Let Q(t) = ((5i(t), • • • ,QK{t)) be the joint QSI of the 
if-user OFDMA system, where Qk{t) denotes the number of 
bits in the k-\h queue at the beginning of the t-\h slot. Nq 
denotes the maximum buffer size (number of bits). Thus, the 
cardinality of the joint QSI is Iq = {Nq + 1)^, which grows 
exponentially with K. Let Nh denote the cardinality of \Hk^n\ 
(V/c, n). Hence, the cardinality of the global CSI is given by 
Ih = N^""^. Let R(t) = {Ri(t), ■ • • ,RK(t)) (bits/second) 
be the scheduled data rates of the K users, where Rk{t) is 
given by ([T]). We assume the controller is causal so that new 
bit arrivals A(t) is observed only after the controller's actions 
at the t-th slot. Hence, the queue dynamics is given by the 
following equation: 

Qk{t + 1) = min { [Qk{t) - Rk{t)T] + + Ak{t), TVq}, 

\Ike{l,K} (2) 

where x+ = max{x, 0} and r is the duration of a scheduling 
slot. 

For notation convenience, we denote x(f) — (H(t),Q(t)) 
to be the global system state at the t-th slot. Therefore, the 
cardinality of the state space of % is /^ = Ih y^ Iq = 
[N^^ {Nq + 1 ) ) . Given the observed system state realization 
x{t) at the beginning of the t-th slot, the transmitter may 
adjust the transmit power and subband allocation (equivalently 
data rate R(t)) according to a stationary power control and 
subband allocation policy defined belo\Mj. 

Definition 1: {Stationary Power Control and Subband Al- 
location Policy) A stationary transmit power and subband 
allocation policy Vt = {ftp^ l^g) is a mapping from the system 
state X to the power and subband allocation actions. A policy 
Q is called feasible if the associated actions satisfy the average 
total transmit power constraint and the subband assignment 

^In this paper, our derived results are based on ^ = 1 for notation simplicity, 
which can be easily extended to other cases. 

^The quasi-static assumption is realistic for pedestrian mobility users where 
the channel coherence time is around 50 ms but typical frame duration is less 
than 5ms in next generation wireless systems such as WiMAX. On the other 
hand, we assume the CSI is i.i.d. between slots (as in many other literature) 
in order to capture first order insights. Similar solution framework can also 
be extended to deal with correlated fading. 

^At the beginning of the t-th scheduling slot, the cross-layer controller 
observes the joint CSI H(t) and the joint QSI Q(t) and determines the 
transmit power and subband allocation across the K users. 



constraint. Specifically, ftp{x) = P = {Pk,n > : \/k^n} and 
^s{x) = s = {sk^n e {0, 1} : V/c, n} satisfy 



Nf 



K 



(3) 



(4) 



fe=i 



Furthermore, Vt also satisfies an average packet drop rate 
constraint for each queue as follows 

VT[Qk = NQ]<P^, VfcG{l,K} (5) 

■ 
From ([T]), the vector queue dynamics is Markovian with the 
transition probability given by 

Pr[Q(t + l)|x(t),l^(xW)] 
= Pr [A{t) = Q{t + 1) - [Q{t) - R(t)r]+] 

= Y[^^ [Mt) = Qk{t + 1) - [Qk{t) - Rk{t)r]+] (6) 

k 

Note that the K queues are coupled together via the control 
policy Q and the constraint in (0]). From Assumption [T] the 
induced random process %(t) = (H(t), Q(t)) is Markovian 
with the following transition probabilit}o 

Pr[x(t + l)|x(t),l](x(t))] 
= Pr[H(t + l)|xW,l^(x(t))]Pr[Q(t + l)|xW,^(x(t))] 
= Pr[H(t + 1)] Pr[Q(t + l)|xW, ^(x(t))] (7) 

where Pr[Q(t + l)\x{t) , ^{x{t))] is given by ©. Given a 
unichain policy ft, the induced Markov chain {x{t)} is ergodic 
and there exists a unique steady state distribution tt^ where 
7T^{x) = lim^^oo Pi'[x(t) = xl- The average utility of the 
k-th user under a unichain policy Q is given by: 



T,{n) = ^lim^ i.J2nf{Qk{m = E.^ [f{Qk)] , 

Wke{l,K} (8) 



t=l 



where f{Qk) is a monotonic increasing function of Qk and 
Et^^ denotes expectation w.r.t. the underlying measure tt^ . For 
example, when f{Qk) = ff , Tk{n) = ^E,^ [Qk] is the 
average delay of the k-th user. Another interesting example 
is the queue outage probability Tk{^) = Pr[(5/c ^ Q^], in 
which f{Qk) = l[Qk > Qfe], where Q^ e {0,Nq} is the 
reference outage queue state. Similarly, the average transmit 
power constraint in ^ and the packet drop constraint in ^ 
can be written as 



Pk{^) 



lim — 

T^ooT 



^E[^Pfc,,(t)] =E^,[^Pfc,n 
t=l n n 

yke{i,K} 



<Pk 



(9) 



6 Although the QSI Q(t + 1) and CSI H(t) are correlated via the control 
action n(x(t)), due to the i.i.d. assumption of CSI in Assumption[T] H(t+1) 
is independent of x(t). Note that H(t) being i.i.d. is a special case of 
Markovian model. Hence, {7} holds under the H(t) i.i.d. assumption in 
Assumption [T] 
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P,^(^)= lim ^E\l[Q,{t)=NQ] 



=E^^ [l[Qk = Nq]\ < P^ \/k e {1,K} 



(10) 



III. CMDP Formulation and General Solution of 
THE Delay-Optimal Problem 

In this section, we shall formulate the delay-optimal prob- 
lem as an infinite horizon average reward constrained Markov 
Decision Problem (CMDP) and discuss the general solution. 

A. CMDP Formulation 

A MDP can be characterized by a tuple of four objects, 
namely the state space, the action space, the transition prob- 
ability kernel as well as the per- stage reward function. In 
our delay-optimization problem, we could associate these four 
objects as follows: 

• State Space: The state space for the MDP is given by 
{x\-" ,%'-}, where x* = (H%Q^) (1 < i < /^) is a 
realization of the global system state. 

• Action Space: The action space of the MDP is given by 
{(7(X^),--- ,1](%^^)}, where ^ is a unichain feasible 
policy as defined in Definition [T] 

• Transition Kernel: The transition kernel of the MDP 
PAX^\X\^{X')] is given by ©. 

• Per-stage Reward: The per- stage reward function of the 
MDP is given by d(x, ^(x)) = Ek Pkf{Qk)- 

As a result, the delay-optimal control can be formulated as 
a CMDP, which is summarized below. 

Problem 1 (Delay -Optimal Constrained MDP): For some 
positive constants /3 = (/3i,--- ,/3kH the delay-optimal 
problem is formulated as 

K 



mmJ^(n) = Y,hTkm 



k=i 



1 

^^^^j^Y.^[d{x{t)Mx{t))) 



(11) 



s.t. the power and packet drop rate constraints in Q, (fTOl) 



B. Lagrangian Approach to the CMDP 

For any LMs 7^,7^ > 0, define the Lagrangian 

as L/3(^,7) = limT^oo^ELiE[^(7,X,^(%))], where 
7 = (7\--- ,7^) with 7^ = (7^,7^), ^(7,X,^(%)) = 
Efc {M{QkHl\i:nVk.n-PkHt{l[Qk = Nq]-P^)). 
Thus, the corresponding unconstrained MDP for a particular 
LM 7 is given by 



G(7) 



minL/5(fi,7) 



(12) 



^The positive weighting factors /3 in ( fill indicate the relative importance 
of buffer delay among the K data streams and for each given /3, the solution 
to {TT) corresponds to a point on the Pareto optimal delay tradeoff boundary 
of a multi-objective optimization problem. 



where G(7) gives the Lagrange dual function. The dual 
problem of the primal problem in Problem [T] is given by 
max^^o ^(7)- The general solution to the unconstrained MDP 
in ([T2I) is summarized in the following lemma. 

Lemma 1: {Bellman Equation and Subband Allocation Q- 
factor) For a given 7, the optimizing policy for the un- 
constrained MDP in ([T2I) can be obtained by solving the 
Bellman equation (associated with the MDP in (fTTI) ) w.r.t. 
(^,{Q(%,s)}) as below: 

Q(x\s) = min \g{l,x\^.%{x')) 

+ V Pr[x^' \x\ s, %{x')] min Q{x\^')\ - 
VI <z </;^, Vs (13) 

where 9 = LU^y) = min^ L/3(l],7) is the optimal average 
reward per stage and {Q(x, s)} is the subband allocation Q- 
factor. The optimal control poHc}E| is given by 1^* = {01, Q^) 
with ^*(x*) attaining the minimum of the R.H.S. of ([T3]) and 
^s(x*) = argmins Q(%*,s) for any %\ Since the policy 
space we considered consists of only unichain policies, the 
associated Markov chain {x(t)} is irreducible and there exists 
a recurrent statcl. Hence, the solution to (l37l) is unique up to 
an additive constant lfT3ll . ■ 

Proof: Please refer to Appendix A for the proof. ■ 

Using standard optimization theory (lAJ . the problem in ([T2b 
has an optimal solution for a particular choice of the LM 7 = 
7*, where 7* is chosen to satisfy the average power constraint 
in Q and packet drop constraint in ([TOl) . Moreover, it is shown 
in ifTSl that the following saddle point condition holds: 



L(l]*,7)<i:(l^*,7*)<i^(^,7*) 



(14) 



In other words, (r^*,7*) is a saddle point of the Lagrangian, 
then Q* is the primal optimal (i.e. solving Problem [T]), 7* is 
the dual optimal (solving the dual problem) and the duality 
gap is zero. Therefore, by solving the dual problem, we can 
obtain the primal optimal 1]*. 

Remark 1: The optimal control actions are functions of the 
subband allocation Q-factor {Q(%,s)} and the 2K LMs. 
Unfortunately, for any given LMs, determining the subband 
allocation Q-factor involves solving the Bellman equation in 
([T3]) , which is a fixed point problem over the functional space 
with exponential complexity. In other words, it is a system of 
K^^I^ = K^^ [N^''{Nq + 1))^ non-linear equations with 
K^^I^^l unknowns (^, {Q(x, s)}). Furthermore, even if we 
could solve it, the solution will be centralized and the joint 
CSI and QSI knowledge will be required, which is highly 
undesirable. ■ 



^It is known that for CDMP, the optimal policy may be randomized policy. 
However, for implementation consideration, we are interested in deterministic 
policy in this paper. 

^For sufficiently large total transmit power {Pi,--- ,Pk} so that the 
optimization problem in ( fTH is feasible, and the state % = (H, Q) (VH 
and Q = (0, - - - ,0)) is recurrent. 



IV. General Decentralized Solution via 
Localized Stochastic Learning and Auction 

The key steps in obtaining the optimal control policies 
from the R.H.S. of the Bellman equation in iT3[ rely on 
the knowledge of the subband allocation Q-factor {Q(%,s)} 
and the 2K LMs {7^,7^} (I < k < K), which is very 
challenging. Brute-force solution of {Q(%,s)} and 2K LMs 
has exponential complexity and requires centralized imple- 
mentation and knowledge of the joint CSI and QSI (which 
also requires huge signaling overheads). In this section, we 
shall approximate the subband allocation Q-factor Q(x, s) by 
the sum of per-user subband allocation Q-factor Q^{Xk^^k), 
i-^- Q{x^^) ~ X]/c S^lX/c'S/c)- Based on the approximate 
Q-factor, we shall derive a per- stage decentralized control 
policy using a per- stage auction. Next, we shall propose 
a localized online stochastic learning algorithm (performed 
locally at each MS k) to determine the per-user Q-factor 
{Q^iXk^^k)} as well as the two local LMs 7^ = (7^,7^) 
based on observations of the local CSI and local QSI as well as 
the auction result. Furthermore, we shall prove that under the 
proposed per-stage auction, the local online stochastic learning 
algorithm converges almost surely (with probability 1). 

A. Linear Approximation on the Subband Allocation Q-Factor 
and Distributive Power Control 

Denote the per-user system state, channel state, subband al- 
location actions and power control actions as Xk = {Qk-, H/c), 
H/e = {\Hk^n\ ' Vn}, Sk = {sk,n - Vn} and p^ = {pk,n - Vn}, 
respectively. To reduce the size of the state space and to de- 
centralize the resource allocation, we approximate Q(%, s) by 
the sum of per-user subband allocation Q-factor Q^{Xk^ ^k), 
i.e. 



Q(x,s)«^Q'=(Xfc,s,) 



(15) 



where Q''{Xk^^k) satisfies the following per-user subband 
allocation Q-factor fixed point equation for each MS k: 

Q''ixi,sk) = min \gki'y'',xi,Sk,Pk) 

Pfc L 

-^T.^Axi\xl^k.Pk]w'{xi) 



Vl<i</^,Vsfe (16) 

where gkh^, Xk^ s^, p^) = Pkf{Qk) + 7^(En^fc,n - Pk) + 
f(l[Qk = Nq] - P^) and W\xk) = E[Q^(x„Kn = 
l[|^/c,n| > HK-i]})\Xk] (^K-i denotes the largest order 
statistic of the {K — 1) i.i.d. random variables with the same 
distribution as \Hk,n\), and I^ = N^"" {Nq + 1) is the cardi- 
nality of the space of per-user system state. Note that under the 
subband allocation Q-factor approximation, the state space of 



K users is significantly reduced from I^ = (A^^^ (TVq + 1)) 



to KI^ = KN^^Nq^1). 

B. Per-Stage Subband Auction 

The subband allocation control can be obtained by min- 
imizing the original subband allocation Q-factor in ([T3]) 



over subband allocation actions. Using the approximate Q- 
factor, the subband allocation control is given by ^J(%) = 
argminsQ(x,s) ^ argmins^l/c Q^(X/c, s^). This can be ob- 
tained via a per-stage subband auction with K bidders (MSs) 
and one auctioneer (BS) based on the observed realization 
of the system state at each MS Xk- The Per-Stage Subband 
Auction among K MSs is as follows: 

• Bidding: Based on the local observation Xk^ ^ach user k 
submits his bid {Q^{Xk^^k) • Vs/c}. 

• Subband Allocation: The BS assigns subbands to 
achieve the maximum sum bids, i.e. 

s* = l]:(x) = argmin^ Q^(xfc,s,) (17) 

s ^ — ^ 
k 

and then broadcasts the allocation results s* = {s^:V/c} 
to K users. 

• Power Allocation: Based on the subband allocation result 
s^, each user k determines the transmit power, which 
minimizes the R.H.S. of ([T6l) , i.e. 



pI = %M)9kh\xl4.Pk) 



- arg mm 
p 



in\ + Y,P4xi\xl,sl,Pk]W''{xi)] -0 

'fc L ^ — -^ J 



(18) 

k 



x; 



Remark 2: {Optimal Subband and Power Allocation under 
Q-factor Approximation) In proposed per-stage subband auc- 
tion, the subband allocation actions minimize ^^ Q^{Xk-> ^fe), 
and the power allocation actions at each MS minimizes the 
R.H.S. of the per-user subband allocation Q-factor fixed point 
equation in ([T6l) . Therefore, the proposed per-stage subband 
auction achieves the solution of the Bellman equation in ([T3]) 
under the linear Q-factor approximation in ([T5]) . ■ 

Remark 3: {Computational Complexity and Memory Re- 
quirement Reduction at BS) With the per-stage subband auc- 
tion mechanism, the BS does not need to store the per- 
user subband allocation Q-factor {Q^{Xk^^k)} (V/c) and 2K 
LMs for all the MSs, which greatly reduced the memory 
requirement at the BS. On the other hand, the BS does not need 
to perform power allocation for each MS on each subband 
Pk,n {yk^n), which significantly reduces the computational 
complexity at the BS. ■ 



C Online Per-user Primal-Dual Learning Algorithm via 
Stochastic Approximation 

Since the derived power and subband allocation policies are 
all functions of the per-user subband allocation Q-factor and 
LMs, we shall propose an online localized learning algorithm 
to estimate {Q^{Xky^k)} and LMs 7^ at each MS k. For 
notation convenience, we denote the per-user state-action 
combination sls (p = {Xk^ ^k) (V/c). Let i and j (1 < i^j < Icp) 
be the dummy indices enumerating all the per-user state-action 
combinations of each user with cardinality I^ = 2^^/^. Let 

Q^ = {Q^{(p^)r-- ,Q^(^^'^))^ be the vector of per-user 
Q-factor for user k. Let (pk{t) = {xk{t)^^k{t)) be the state- 
action pair observed at MS k at the t-th slot, where Xki'^) — 
{Qk{t)jiik{t)) is the system state realization observed at MS 
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k. Based on the current observation (pk{t), user k updates its 
estimate on the per-user Q-f actor and the LMs according to: 

+ w,'{Qk{i+ 1)) - Qliy^n) - QHy^l] 
■ i[Mt) = y^'] (19) 

r{-f^ + e]{Y,Pk,n{t)-Pk)) (20) 



7^1 



J-t+i 



r{2^ + e]{l[Q,{t) = NQ]-P^)) 



(21) 



where lk{^p\t) = Y!'rn=o'^Wk{m) = (p"-] is the num- 
ber of updates of Q''{(p') till t US), pfe(t) = {pfc,„(t) : 
Vn} is the power allocation actions given be the per- 
stage auction, Wt^{Qk) ^ E[Wt^{Xk)\Qk] with WHXk) = 
^Q^{Xk,{sk,n = l[\Hk,n\ > H*j,_,]})\Xk], i^sup{t : 
(pk{t) = ^^}, ^^ is the reference per-user state-action combi- 
natiorl^ r(-) is the projection onto an interval [0, B] for some 
B > and {ctl^l^Z} ^^^ the step size sequences satisfying 
the following conditions: 

J2e1 = ^4 > 0,e? ^ 0,^e7 = ^.e/ >0,e]^ 0, 

t t 

^((e?)^ + 2(e7n< 00,4^0 (22) 

t ^t 

The above distributive per-user potential learning algorithm 
requires knowledge on local QSI and local CSI only. 

Remark 4 (Comparison to the Deterministic NUM): In 
conventional iterative solutions for deterministic NUM {T2\ . 
the iterative updates (with message exchange^ are performed 
within the CSI coherence time and hence, this limits the 
number of iterations and the performance. However, in the 
proposed online algorithm, the updates evolves in the same 
time scale as the CSI and QSI. Hence, it could converge to a 
better solution because the number of iterations is no longer 
limited by the coherence time of CSI. ■ 

Remark 5: (Comparison to the Conventional Reinforced 
Learning) There are two key novelties in the proposed per-user 
online update algorithms. Firstly, most of the existing literature 
regarding online learning addressed unconstrained MDP only 
191 . In the case of CMDP, the LM are determined offline by 
simulation ifTTl . In our case, both the LM and the per-user 
Q-factor are updated simultaneously. Secondly, conventional 
online learning are designed for centralized solution where 
the control actions are determined entirely from the potential 
or Q-factor update. However, in our case, the control actions 
for user k are determined from {Q^{(p)} (V/c) via a per- 
stage auction. During the iterative updates, both the per- 
user Q-factor/LMs as well as the control actions are changed 
dynamically and the existing convergence results (based on 
contraction mapping argument) cannot be applied directly to 
our distributive stochastic learning algorithm. ■ 

^^Without loss of generality, we initialize the per-user subband allocation 
Q-factor as 0, i.e. QgCc^"^) = Vfc. 

^ ^ Since the iterations within a CSI coherence time involve explicit message 
passing, there is processing and signaling overhead per iteration and this limits 
the total number of iterations within a CSI coherence time. 



D. Convergence Analysis 

In this section, we shall establish technical conditions 
for the almost-sure convergence of the online distributive 
learning algorithm. For any LM 7 (7^ > 0), define a 
vector mapping T^ : R'^ x R^'p -^ R^'p for user /c, and 
T^ = (Ti^, • . . , Tf^Y with the i-th (l<i< I^) component 

mapping defined as T^{^^, Q^) = minp^ 9k{7^,^\Pk) + 



E^,FT[^^\^\Pk]QH^n 
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where P^ is the I^ x I^ transition probability matrix with 
Ft[(p^\(P\ Pti'^)] as its (z,j) -element, where p^(^) denotes the 
power allocation for (p'^ obtained by per- stage subband auction 
at the t-th iteration, and I is the I^ x I^ identity matrix. 

Since we have two different step size sequences {e^} and 
{e^} and e] = o(e^), the LM updates and the per-user Q- 
factor updates are done simultaneously but over two different 
time scales. During the per-user Q-factor update (timescale I), 
we have Jt_^i — 7^ = e{t) and 7^ — 7^ = e{t) (V/c), where 
e{t) = 0{e]) = o(e^). Therefore, the LM appears to be quasi- 
static ifTSl during the per-user Q-factor update in (O. We first 
have the following lemma. 

Lemma 2: (Convergence of Per-user Q-factor Learning 
over Timescale I) Assume for all the feasible policies Q in 
the policy space, there exists sl Sm = O(e^) > and some 
positive integer m such that 

[A^ • • • A^],, > 5m, [B^ • • • B^],. >Sm, l<i<I^ (24) 

where [-jir denotes the element of the i-th row with r-th 
column of the corresponding I^p x I^^ matrix (r is the column 
index in P^ which contains the aggregate reference state 
(f^). For stepsize sequence {e?}, {e^} satisfying the conditions 
in ([22b . we have lim^^oo Qt = Qoo(7) ^^ ^-S- ^^^ ^^Y 
initial per-user subband allocation Q-factor vector Qq and LM 
7, where the converged per-user subband allocation Q-factor 
Q^(7) satisfies: 

(^.'(7', Qlh)) - Ql{^l)e + Qlh) = T'h\ Qlh)) 

(25) 

Proof: Please refer to Appendix B. ■ 

On the other hand, during the LM update (timescale II), we 
have limt^oo \\Qt - Q^(7t)ll = w.p.L by the Corollary 
2.1 of |[T9l. Hence, during the LM updates in ^ and (EU), the 
per-user subband allocation Q-factor update is seen as almost 
equilibrated. The convergence of the LM is summarized below. 

Lemma 3 (Convergence of the LM over Timescale II): 
The iterates lim^^oo 7t = 7 00 ^-^-^ where 7^ satisfies the 
power and packet drop rate constraints in (O and (fTOl) . ■ 
Proof: Please refer to Appendix C. ■ 

Based on the above lemmas, we shall summarize the con- 
vergence performance of the online per-user Q-factor and LM 
learning algorithm in the following theorem. 



Theorem 1 (Convergence of Online Per-user Learning Algoritim^iyen h 
For the same conditions as in Lemma O we have 



13. 



Too satisfy 



(S^,7^) a.s. Vfe, where Q^(7oo) and 



{TrHlLQ'oo)-QU^l>^ 



Qk 

-«^oo 



T'^(7; 



,Q^) (26) 



and 7^ satisfies the power and packet drop rate constraints 
in © and ([T0|. ■ 



V. Application to the OFDMA systems with 
Exponential Packet Size Distribution 

In this section, we shall illustrate the application of the 
proposed stochastic learning algorithm by an example with 
exponential packet size distribution. 



A. Dynamics of System State under Exponential Distributed 
Packet Size 

Let A{t) = {Ai{t),--- ,AK(t)) and N(t) = 
[Ni(t)^--- ^Nxit)) be be the random new packet arrivals 
and the packet sizes for the K users at the t-th scheduling 
slot, respectively. Q(t) = (Qi(t), • • • ,Qx(^)) and Nq 
denote the joint QSI (number of packets) at the end of the 
t-th scheduling slot and the maximum buffer size (number of 
packets). 

Assumption 3: The arrival process Ak{t) is i.i.d. over 
scheduling slots according to a general distribution Pr(A/e) 
with average arrival rate E[Ak] = Xk- The random packet size 
Nk{t) is i.i.d. over scheduling slots following an exponential 
distribution with mean packet size Nk. ■ 

Given a stationary policy, define the conditional mean 
departure rate of packets of user k at the t-th slot (conditioned 
on x{t)) as fikixit)) = Rk{x{t))/Nk- 

Assumption 4: The slot duration r is sufficiently small 
compared with the average packet service time, i.e. 
Mfc(xW)r<lE|. ■ 

Given the current system state x(t) and the control action, 
and conditioned on the packet arrival A{t) at the end of the 
t-th slot, there will be a packet departure of the k-th user at 
the {t + l)-th slot if the remaining service time of a packet 
is less than the current slot duration r. By the memory less 
property of the exponential distribution, the remaining packet 
length (also denoted as N(t)) at any slot t is also exponential 
distributed. Hence, the transition probability to Qk{t ^1) at 
the {t + l)-th slot corresponding to a packet departure event 

^^This assumption is reasonable in practical systems. For instance, in the 
UL WiMAX (with multiple UL users served simultaneously), the minimum 
resource block that could be allocated to a user in the UL is 8 x 16 symbols 
— 12 pilot symbols=116 symbols. Even with 64QAM and rate ^ coding, 
the number of payload bits it can carry is 116 x 3bits=348 bits. As a result, 
when there are a lot of UL users sharing the WiMAX AP, there could be 
cases that the MPEG4 packet (around lOK bits) from an UL user cannot 
be delivered in one frame. In addition, the delay requirement of MPEG4 is 
500ms or more, while the frame duration of Wimax is 5ms. Hence, it is not 
necessary to serve one packet during one scheduling slot so that the scheduler 
has more flexibility in allocating resource. Therefore, in practical systems, an 
application level packet may have mean packet length spanning over many 
time slots (frames) and this assumption is also adopted in f20l - f23l . 
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Rk(t) 

Nk{t) 



1) = Ak{t) + Qk{t) - i\x{t)^A{t),n{xm 

<r\x{t).A{t),n{x{t)) 



,, < ^k{x{t))r 
exp(-/ife(xW)r) ^ fik{x{t))r 



(27) 



where the last equality is due to Assumption |4l Note that 
the probability for simultaneous departure of two or more 
packets from the same queue or different queues in a slot is 
0((/i/c(x(t))r)^), which is asymptotically negligible. There- 
fore, the vector queue dynamics is Markovian with the tran- 
sition probability given by 

Pv[ci{t + i)\x{t)Mxm 

= ^Pr[A(t) = Q(t + l)-Q(t) 



Pr[A(i) = Q(t + 1) 



ek]l^kixit))T 
Q{t)]{l-Y,^^k{x{t))r) (28) 



where e^ denotes the standard basis vector with 1 for its k-th 
component and for every other component. 

B. Decomposition of the Per-user Subband Allocation Q- 
f actor 

In the following lemma, we shall show that the per-user 
subband allocation Q-factor Q^{Xk^ ^k) can be further decom- 
posed into the sum of per-user per-subband Q-factor, which 
further simplifies the learning algorithm. 

Lemma 4 (Decomposition of Per-user Q-factor): The per- 
user Q-factor Q^{xk^^k) (defined by the fixed point equa- 
tion in (O) can be decomposed into the sum of the per- 
user per-subband Q-factor {q^{Q, \H\,s)}, i.e. Q^{Xk^ ^k) = 

T.n^^(QkAHk,n\^Sk,n), whcrC 

J = min {^fe,n(7^, Qk, \Hk,n\,Sk,n,Pk,n) 

Pk,n 

NF8w\Qk)T 



2\QkAHk 



,Sk, 



m, 



-5/c,nlog(l+P/e,n|^/c,nP) 

■nw^{Qk^Ak)\Qk]-ir} (29) 
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Furthermore, we have W^{Qk) = NFW^iQk)- ■ 

Proof: Please refer to Appendix D for the proof. ■ 

Based on the per-user per-subband Q-factor 

{q^{Q^\H\^s)}, we can obtain the closed-form power 

^^ Since Nf^{t) is exponentially distributed and is memoryless, we have the 
probability in (2J\ (conditioned on the current state x(t) and the associated 
action Q(x(t)) ) independent of the previous states {x(t — 1), x(f — 2), • • • }. 
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Initialization: Set t=0. Each user k=1 :K, 
chooses an Initial potential q^ and LMTo ■ 



_V_^_ 



Online Policy Improvement: At the beginning 
of the t-th slot, BS performs the per-subband 
auction to obtain the policy ^itfor the t-th slot. 



"xk,n (A:=l : A') 



Online Potential and LM Update: At the end 

of t-th slot, each user k=1 :K updates the 
potential qf+iaccordlng to (24), and updates 
the multiplier 7f + i according to (25), (26) for 
the (t+1)th slot. 




s^.,„ {fc= 1 :K) 



d Alloi 

jns sul 
ling to 



I Subband Allocation | 

I BS assigns subband . 
I according to (23). 



I Power Allocation 

. Each user determines 

' transmit power 

I according to (24) 



Pk,» (k^l: K) 



'f 



Fig. 2. Algorithm Flow of the Online Distributive Primal-Dual Value Iteration 
Algorithm with Per-stage Auction and Simultaneous Updates on Potential and 
Lagrange multipliers (LM). Note that t = {0, 1, 2, ...} is the scheduling slot 
index. 



allocation actions minimizing the R.H.S. of the per-user 
subband allocation Q-factor fixed point equation in ([T6l) . 
which is summarized in the following lemma: 

Lemma 5 (Decentralized Power Control Actions): Given 
subband allocation actions s^, the optimal power control 
actions of user k under the linear approximation on subband 
allocation Q-factor in ([T5l) are given by 






1 



\H, 



k,n I 



,)\ Vn 
(33) 



Proof: Please refer to Appendix E for the proof. ■ 

Remark 6: (Multi-level Water-filling Structure of the Power 
Control Action) The power control action in ([33l) of Lemma [5] 
is both function of CSI and QSI (where it depends on the QSI 
indirectly via Sw^(Qk), which is function of {q^{Q, \H\,s)}). 
It has the form of multi-level water-filling where the power is 
allocated according to the CSI across subbands but the water- 
level is adaptive to the QSI. ■ 



C Per-Stage Per-Subband Auction 

Applying the per-stage subband auction in Section IV-CI to 
the system dynamics setup in this section, we obtain a low 
computational complexity and signaling overhead Scalarized 
Per-Subband Auction (Vn G {l^Np}) as illustrated in Fig. 
[21 which is based on the per-user subband allocation Q- 
factor decomposition in Lemma H] and the closed-form power 
allocation actions in Lemma [5] as follows: 

• Bidding: For the n-th subband, each user submits a bid 



x.,„=^M5(Qi)liog(i + |^,,„P( 
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Subband Allocation: The BS assigns the n-th subband 
according to the highest bid: 



4JH,,Q) = 
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if k = kl 
otherwise 
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(34) 



where /c* = argmax/^ X/e,n denotes the user with the 
highest bid and then broadcasts the allocation results to 
K users. 

Power Allocation: Each user determines the transmit 
power according to: 



Pinion, Q) 
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(35) 



Remark 7 (Comparison to Brute-Force (CSI,QSI)-Feedback): 
In the brute-force (CSI,QSI)-feedback scheme, each MS k 
needs to feedback CSI \Hk^n\ (Vn), QSI Qk and the LMs 



7fe. 



BS needs to solve the subband allocation 



and 



power allocation p^ ^, and broadcast the (real number) power 
allocation p^ ^ to the MSs. Note that for the signaling from 
MS to BS, the quantization bits used in signaling for the 
bid Xk^n versus for the CSI \Hk^n\ is similar. However, the 
proposed per-subband auction does not need to feedback QSI 
and LM. For the signaling from BS to MS, the proposed 
per-stage auction only needs 1 bit per subband for si ^. 
However, the brute-force (CSI,QSI)-feedback scheme needs 
much more bits per subband for a relatively accurate p^ ^ to 
ensure acceptable performance. Therefore, compared with the 
brute-force (CSI,QSI)-feedback scheme for uplink OFDMA 
systems, the proposed scalarized per-subband auction greatly 
reduces the signaling overhead and computation complexity 
(at the BS) for subband allocation and power allocation in 
the decentralized solution. ■ 

D. Online Per-user Primal-Dual Learning Algorithm via 
Stochastic Approximation 

In this part, we shall apply the online localized primal-dual 
learning algorithm in Section IV-Dl to estimate {q^{Qj \H\^s)} 
and LMs. The update equations for LMs are the same as 
([2Qb and ([2Tt . and hence, we shall focus on the online 
learning of per-user per-subband Q-factor {q^{Q, \H\,s)} in 
the following. 

For notation convenience, we denote the per-user per- 
subband state-action pair as (/) = (Q, \H\^s). Let i (I < i < 
i^) be a dummy index enumerating over all the possible state- 
action pairs of each user over one subband with cardinality 
I^ = 2Nh{Nq^1) and (j^kA^) = [Qk{t), \Hk,n{t)lsk,n{t)) 
be the current state-action pair observed at MS k on subband n 
at the t-th slot. Based on the current observation (l)k,n{t), user 
k updates its estimate on the per-user per-subband Q-factor 
according to: 

') =qUf)^^l^^^,t)[9k,n^(^l'l''^Pk,n^it)) 

-^wnQk{i^l))-qH^'))-Qtm] 

■ l[Un {(l^kAt) = f}] (36) 



Qt^ii 



where lk{(t)\t) = EL^o.^t^n {^/c,nM = </>'}] is the 
number of updates of q^{(j)^) till t |[T6l , n^ G {n : (j)k,n{t) = 
(/)*]Ej, f — sup{t : (t)k,n{t) = 0^}, ^^ is the reference (per- 
subband) state-action combinatiorEj (per-user per-subband), 



E. Rate of Convergence and Asymptotic Performance 

In this section, we shall discuss the convergence speed as 
well as the asymptotic performance of the proposed distribu- 
tive stochastic learning algorithm. For instance, we are inter- 
ested in how the convergence speed scales with the number of 
MS K and the number of subbands TV. In the asynchronous 
per-user per-subband Q-f actor learning algorithm, at slot t, 
each user k updates the Q-factor of all the per-user per- 
subband state-action pairs observed in A^ subbands. Thus, 
the convergence speed of the asynchronous per-user per- 
subband Q-factor learning algorithm depends on the speed 
that every per-user per-subband state-action pair of each user 
k is visited at the steady state. We define the ergodic visiting 



speed for each MS /c as 14 = 



lim. 



mini Iki^^P^ it) 



where 



i(m) = (j)^}\ is the number 
of updates of q^{(j)^) up to slot t. The following lemma 
summarizes the main results regarding the ergodic visiting 
speed. 

Lemma 6 (Ergodic Visiting Speed w.r.t. K and N): The er- 
godic visiting speed for each MS k of the per-user per-subband 
Q-factor stochastic learning algorithm in (l36l) is given by 
Vk = 0{N/K) (Vfe). ■ 

Proof: Please refer to Appendix F. ■ 

Remark 8 (Interpretations): Note that the convergence rate 
of the learning algorithm is related to Vk = 0{N/K). Observe 
that the convergence speed increases as N increase. This is 
because in the asynchronous update process in (l36l) . each user 
k updates the Q-factor of all the per-user per-subband state- 
action pair observed in N subbands in a single time slot. 
Hence, there is intrinsic parallelism in the learning process 
across different subbands. 

Finally, we shall show that the performance of the dis- 
tributive algorithm is asymptotically global optimal for large 
number of users. 

Theorem 2 (Asymptotically Global Optimal): For 
sufficiently large K such that the optimization Problem 
[T] is feasible, the performance of the online distributive 
per-user primal-dual learning algorithm is asymptotically 
global optimal, i.e. Ylk=i Q^iXk^^k) ^ Q*(x,s) and 
7oo ^ 7* as i^ ^ oo, where Q*(x,s) and 7* are the 
solution of the centralized Bellman equation in ([T3l) satisfying 
the corresponding constraints in Q, ([TOb . ■ 

Proof: Please refer to Appendix G. ■ 



i^Vn^^ e {n : (/>fc,nW = ^'}' 9k,n^ht ^(t^'^Pk,n^('^)) is equal. 

^^The reference (per-user) state-action combination ip^ is composed of the 
(per-subband) state-action combination (p^ . For example, say Np = 2, Q = 
{0, 1}, \H\ = {Good (G), Bad (B)}, s = {0, 1}, I^ = 2 x 2^^ x 2'^ = 48, 
I^ = 2 X 2 X 2 = 8. Let (f)^ = (0,B,0), then (/?^ = (0, {B,B}, {0, 0}) 
(aggregated over 2 subbands). Without loss of generality, we initialize the 
per-user per-subband Q-factor as 0, i.e. g§((/)^) = V/c. 




SNR (dB) 



Fig. 3. Average delay per user versus SNR. The number of users K = 2, 
the buffer size Nq = 10, the mean packet size A^^ = 305.2 Kbyte/pck, the 
average arrival rate A^ = 20 pck/s, the queue weight /5i = ^2 = 1- The 
packet drop rate of the proposed scheme is 5% while the packet drop rate of 
the Baseline 1 (M-LWDF), Baseline 2 (CSIT Only) and Baseline 3 (Round 
Robin) are 5%, 8%, 9% respectively. 



VI. Simulation Results and Discussions 

In this section, we shall compare our proposed per-user 
online learning algorithm via stochastic approximation to the 
delay optimal problem for OFDMA uplink systems with the 
centralized subband allocation Q-factor {Q(x,s)} learning 
algorithm and three other reference baselines. Baseline 1 refers 
to a throughput optimal polic}o namely the Modified Largest 
Weighted Delay First (M-LWDF) l[24\l , in which the subband 
and power control are chosen to maximize the weighted 
delay. Baseline 2 refers to the CSIT Only Scheduling, in 
which optimal subband and power allocation is performed 
purely based on CSIT. Baseline 3 refers to the Round Robin 
Scheduling, in which different users are served in TDM A 
fashion with equally allocated time slots and water-filling 
power allocation across the subbands. In the simulation, we 
consider Poisson packet arrival with average arrival rate A^ 
(pck/s) and exponential packet size distritution with mean Nk. 
We consider average delay as our utility (f(Qk) = t^). We 
assume there are 64 subbands with total BW lOMHz, and the 
number of independent subbands Np is 4. The scheduling slot 
duration r is 5ms. The buffer size Nq is 10. 

Figure [3] illustrates the average delay per user versus SNR of 
2 users. It can be observed that both the centralized solution 
and the distributive solution have significant gain compared 
with the three baselines (e.g. more than 7.5 dB gain over M- 
LWDF when average delay per queue is less than 9 packets). 
In addition, the delay performance of the distributive solution, 
which is asymptotically global optimal in large number of 
users, is very close to the performance of the optimal solution 
even in K = 2. Similar observations could be made in Figured 
where we plot the average weighted delay versus SNR of two 
heterogeneous users. 

Figure \5\ illustrates the average delay per user of the 
distributive solution versus the number of users at a transmit 

^^ Throughput optimal policy means that it shall stabilize the queue when- 
ever the arrival rate vector falls within the stability region. 
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Queue Length (Qo) 



Fig. 4. Average weighted delay versus SNR. The number of users K = 2, 
the buffer size Nq = 10, the mean packet size N]^ = 305.2 Kbyte/pck, the 
average arrival rate A^ = 20 pck/s, the queue weight /5i = 1, /32 = 4. The 
packet drop rate of the proposed scheme is 7% while the packet drop rate of 
the Baseline 1 (M-LWDF), Baseline 2 (CSIT Only) and Baseline 3 (Round 
Robin) are 7%, 9%, 9% respectively. 



Fig. 6. Cumulative distribution function (c^ of the queue length. The 
buffer size Nq = 10, the mean packet size A^^ = 78.125 Kbyte/pck, the 
average arrival rate A^ = 20 pck/s, the queue weight ^/j = 1, the number 
of users K = 6 at a transmit SNR= lOdB. The packet drop rate of the 
proposed scheme is 2% while the packet drop rate of the Baseline 1 (M- 
LWDF), Baseline 2 (CSIT Only) and Baseline 3 (Round Robin) are 2%, 8%, 
8% respectively. 




Average Delay 

Distributive: 5.8769 

M-LWDF: 9.143 

CSIT Only: 8.358 

Round Robin: 9.152 



Average Delay 

Distributive: 5.3874 

M-LWDF: 9.143 

CSrr Only: 8.358 

Round Robin: 9.152 



Number of Users 



Fig. 5. Average delay per user versus the number of users. The buffer size 
Nq = 10, the mean packet size N^ = 78.125 Kbyte/pck, the average arrival 
rate Xk = 20 pck/s, the queue weight /3fe = 1 at a transmit SNR= lOdB. 
The packet drop rate of the proposed scheme is 4% while the packet drop 
rate of the Baseline 1 (M-LWDF), Baseline 2 (CSIT Only) and Baseline 3 
(Round Robin) are 4%, 8%, 9% respectively. 



SNR= lOdB. It is obvious that the distributive solution has 
significant gain in delay over the three baselines. Figure [6l 
further illustrates the cumulative distribution function (cdf) of 
the queue length for i^ = 6 and SNR= lOdB. It can be seen 
that the distributive solution achieves a smaller queue length 
compared with the other baselines. 

Figure |7] illustrates the convergence property of the proposed 
algorithm. We plot the average {W^{Qk)} of 10 users versus 
scheduling slot index at a transmit SNR= lOdB. It can be 
seen that the distributive algorithm converges ^ite fasO The 
average delay corresponding to the average {W^{Qk)} at the 
500-th scheduling slot is 5.9 pck, which is much smaller than 

^^In conventional iterative algorithms for deterministic NUM, there is 
message passing between iterative steps within a CSI realization and these 
iterative steps (before convergence) are overheads because they do not carry 
useful payload. On the other hand, the proposed algorithm is an online 
distributive algorithm and hence, the slots before "convergence" also carry 
useful payload and they are not "wasted". 




1 500 2000 2500 

Scheduling Slot Index 



Fig. 7. Illustration of convergence property. The average {W^{Qk)} of 10 
users versus the scheduling slot index. The number of users K = 10, the 
buffer size Nq = 10, the mean packet size A^^ = 78.125 Kbyte/pck, the 
average arrival rate A^ = 20 pck/s, the queue weight /3fc = 1 at a transmit 
SNR= lOdB. The packet drop rate of the proposed scheme is 4% while the 
packet drop rate of the Baseline 1 (M-LWDF), Baseline 2 (CSIT Only) and 
Baseline 3 (Round Robin) are 4%, 8%, 9% respectively. 



the other baselines. 

VII. Summary 

In this paper, we consider a distributive delay-optimal power 
and subband allocation design for uplink OFDMA system, 
which is cast into an infinite-horizon average-reward CMDR 
To address the distributive requirement and the issue of ex- 
ponential memory requirement and computational complexity, 
we proposed a per-user online learning with per-stage auction, 
which requires local QSI and local CSI only. We show that 
under the auction, the distributive online learning converges 
with probability 1. For illustration, we apply the proposed 
learning algorithm to an application example with exponential 
packet size distribution. We show that the delay-optimal power 
control has the multi-level water-filling structure. We show 



that the proposed algorithm converges to the global optimal 
solution for sufficiently large number of users. Numerical 
results illustrated significant delay performance gain over 
various baselines. 



Appendix 

Appendix A: Proof of Lemma[T] 

For a given 7, the optimizing policy for the unconstrained 
MDP in ([T2I) can be obtained by solving the Bellman equation 
w.r.t. (<9,{F(x)}) as below El: 



= mm 



'gin.x'Mx')) 



yi<i<i^ (37) 



E[SM^{cp')SM^,{cp')] = (Vt ^ f). For some j, define 
Mf'i^p') = ELj ^l^^ti^')- Then, from (|38]), we have 

t 
=QW) + E ^?^f (7'' ^') + ^'(^0 (39) 

Since Et[M/^((/:)^)] = M^_^{ip'), M^{(p') is a Mar- 
tingale sequence. By martingale inequality, we have 
Pr,-{sup,<,<JM,'=(^0| > A} < MMl^fnn, By the 
property of martingale difference noise and the condi- 
tion on the stepsize sequence, we have Ej[\M^{(p'^)\'^] 






MY^UAe^)^ < 00, where M = m8iXj<i<t {SMj'iip'))^ < 



where ft{x^) = (p, s) is the power control and sub- 
band allocation actions taken in state %% 6> = ^^^(7) = 
inf^L/3(l],7) is the optimal average reward per stage, 
{^(x)} is the potential function of the MDP. Since 
^(X*) = (^s (x* ) 7 ^p (%*))' we define the suhhand allo- 
cation Q-factor of state %* under subband allocation ac- 
tion s as Q(x%s) = min^^(^z) g{l,x\^^%{x')) + 

T.^.'^Ax'\x\^.%{r)]V{x')\ - 0. Thus, V{x) = 
mins Q(x,s) (V%) and {Q(x,s)} satisfy the Bellman equa- 
tion in ([T3]) . 

Appendix B: Proof of Lemma [2] 

Since V/c, each state-action pair (/:)* is updated comparably 
often |fT6l, the only difference between the synchronous update 
and asynchronous update is that the resultant ODE of the asyn- 
chronous update is a time- scaled version of the synchronous 
update lfT6l . However, it does not affect the convergence 
behavior. Therefore, we consider the convergence of related 
synchronous version for simplicity in the following. 

Due to symmetry, we only consider the update for user k. 
It can be easily proved that the synchronous version of the 
per-user Q-factor update in ([T9l ) is equivalent to the per-user 
Q-factor update given by 



Q?+i(^^) 



Q?((^^) + e?r,^(7^(/:^^) 1 < i < /^ (38) 



where Y,\-i\^^) = g^h^ ^\ p' (t)) + W,^{Qk{t + 1)) - 

(^i^(7^^^p'(^")) + wnQi) - QH^n) - qu^')- ^^- 

note Y,^ ^ (y,^(7^(/^^),•••,>^t'(7^^'-))^. Let Q^ ^ 
{Ql-'- . Qf ) and Y, 4 (Yi, • • • , Yf ) be the aggregate 
vector of per-user Q-factor and Y^ (aggregate across all K 
users in the system). We shall first establish the convergence 
of the martingale noise in the Q-factor update dynamics. 
Let Et and Pr^ denote the expectation and probability con- 
ditioned on the cr-algebra Tt^ generated by {Qq,Y^,z < 
t}, i.e. Et['] = E[-\Tt] and Prt[-] = Pr[-|J't]. Define 
RHl^^') = E,K^(7^(^^)] = T^{l\Q^t) - QU^') - 
(T,^(7^Q,') - QH^n)^ and SM,^{^^) ^ Y,^h\^') - 
EtK^(7^,(/:^')]. Thus, 8M^{if'') is the martingale difference 
noise satisfying the property that Et[5M^{(^^)] = and 



00. Hence, we have Hirj^oo^^j {^^Pj<i<t\Mti^^)\ — 
A} -^ 0. Thus, from ^, we have Q^^i{(p^) = Q^{^') + 
Yi^j ^?^f (7^5 ^*) a.s. with the vector form 



Qf+i 



Qy 



1=3 



(40) 



where Rf = T'=(7^ Qf) - Qf - (T/=(7^ Sf) - Qf ((p'"))e 
and e = [1, • • • , 1]^ is the /^^ x 1 unit vector. 

Next, we shall establish the convergence of the dynamic 
equation in (|4Ql) after the martingale noise are averaged out. 
Let g^ and P^ denote the reward column vector and the 
transition probability matrix under the power allocation p^, 
which attains the minimum of T^ of the t-th iteration. Denote 
z^ = T^^(7^, Q^t) - Qti^'')- Then, we have 



Rti=gti+^'-iQti-Qti- 



H-l' 



<g,^ + p,'Qti-s; 



k 
t-1 



H-l"" 



by iterating 



< BtiRti - (z^ - zt,)e, Vfc > 1 



< Bj_i •••Bj_^qj_^ 



Since RUi', vl = THi', Qt) " Qti^n " {THi', Qt) " 
Qfc((p'-)) = Vt, by (El, we have 

(1 - 5^) min i?t„»(7', /) " {z^ " zlm) < Rtil\ f') 
< (1 - S^) maxi^t^(7^ /) - {z^ - zlJVi 



) 



I 

'mini' i^t^(7^ ¥>*') > (1 - 5m) mini, •R^„^(7^ ¥>*' 
maxi' RUi'', </?'')< (1 - <^m) maxi- R^_^{-f'', ip^') 

-{Zt - Zt-m) 
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=> max R^ (7^ , V^'' ) - min R^ (7^ , cp^' ) 

i' i' 

< (1 - ,5™)(maxi?t„»(7',/) - mini?t„»(7',/)) 

^maxi?f(7^/) - mini^,^(7^/) < 0^ TT (1 - 5j+im) 

where ^j > 0. Since R^{'^^^(p^) = Vt, we have 

maxi/ i?^ (7^ ,(/:?*') > and mini/ R^ (7^ , V^^' ) < 0. 

Thus, Mi, we have | i?^ (7^, (/:?*) | < max^/ i?^(7^, (/:?* ) — 

•/ I ^~j I 

min^/ i?^(7^,(/:)' ) < (/>j nz=T (^ ~ ^j+/m). Therefore, as 

ilar to the 

1 in Chapt 

up an additive constant. Since Qj^((^^) = Qo(^^) ^^' we have 

the convergence of the per-user subband allocation Q-factor 

limz^oo Q^t = 2^(7) almost surely. 



where W^{Qk 



nW\Xk)\Qk] and /^W\Qk) 



nW^{Qk + Ak) - W^{Qk + Afc - l)|Qfc]. Then, we have 
Q^iXk^^k) = T^n^^iQk^ \Hk,n\^Sk,n)- Thus, wc Can derive 

^'(%.) =E[Q^(x„Kn = l[\Hk,n\ > H^K-l]})\Xk] 



E 



^[q {Qk-, \Hk,n\-,Sk,n = l[|^/c,n| ^ ^K-lDlQ/c? ^/c,n] 



t ^ 00, R^ ^ 0, i.e. 2^(7) satisfies equation in (|25]) . Sim- 






potential function of Bellman equation (Proposition =^iE[^fe(^g^^ |i:^fe,n|)|Q/c] = NFW^{Qk) 
ter 7 of (O^l), the solution to (|25]) is unique only n ^ ^ ^ 



i)IQa 



Appendix C: Proof of Lemma [3] 

Due to the separation of time scale, the primal update of the 
Q-factor can be regarded as converged to Q^(7t) w.r.t. the 
current LMs 7^ |[T9l . Using standard stochastic approximation 
theorem ifTSll , the dynamics of the LMs update equation in (|2Ql) 
and (I2TI) can be represented by the following ODE: 

^(t) =E"'W*)) [C£pi,n - Pi), {l[Qk = Nq] - Pf), ■■■, 



{^PK,n 



PK),{i[QK = NQ]-P^) 



(41) 



Hi.Qi 



where n*{f{t)) = {n;{-r{t)),n:{f{t))) is the 
converged control policies in (fTSl) and (fTTl) w.r.t. 
the current LM 7(t), and E^*^^'^^^^^] denotes the 
expectation w.r.t. the measure induced by ^*(7). Define = min [0^(7^ xi ^k Pk) 

G{-f) = E^*(^)[Efc^fe(7^%l,Sfe,pfe)]. Since subband ^^ ^ 

allocation policy is discrete, we have ^^(7) = 1^J(7 + S^). 



=Nf E[w^{Qk + Ak) - w\Qk ^Au- l)\Qk] 

Therefore, from (l42l) , we can obtain ([29l) . 

Appendix E: Proof of Lemma[5] 
The conditional transition probability of user k is given 

by Pr[x^|x*^,Sfc,P/c] = Pr[H^]Pr[Q^|x*^,Sfc,Pfe], where 
Pr[QilxLs/c,P/c] = Pr[Afe = Qfc-Qfc + l]M/c(Xfc,Sfe,p/c)r + 
Pr[Afe = Qi - Ql][l - iik{x!k^^k,Vk)r). 

QHxl^k) 

nin U/c(7^,xLs/e,p/e) 
Pfc L 

+ ^ Pr[Hi]Pr[gi|xLs,,p,]I^^(xi)' 



(a) 



aG 



Hence, by chain rule, we have ^ 



E 



(n;(7),^:(7)) 



^-y's 



\p dG ^Pk,n, 

2^k,n dpi ^ dj'^ 



lEnPln - Pk]- Since 1^;(7) 



argmm^^(^) 



we 



7^k 



Tit)- 



Similarly, |^ = e(^;(^)>^*(^)) [l[Qk = Nq]-P^] = 7'=(i). 
Therefore, we show that the ODE in (1411) can be expressed as <^ mm 7 2_^ pk 
7(t) = vG(7(t)). As a result, the ODE in dlB will converge ^'^ n 

to VG(7) =0, which corresponds to Q and (fTOl) . 

Appendix D: Proof of Lemma[4] 



-^Y.^r[Qi\x'k.Sk.Pk]W\Qi) 
Ql 
min U/c(7^,%Ls/c,P/c) 

Pfc L 

+ {l-^ik{xl,Sk,Pk)T)E[W''{Ql 
+ iikixi, Sfe, Pk)TE[W''{Qi + Ak 



-Ak)\Qk 
i)\Qk 



= (2^Sfc,„l0g(l 



Pk,, 



\Hk 



(43) 



')) 



where (a) is due to ([T6l) and the above per-user transition prob- 
ability, (b) is due to the definition WH^Qk) = HW^J^j,)\Qk] 
and (d)Js due to the definition AW^iQk) = ^W^iQk + 
Ak) - W^{Qk -^ Ak - l)\Qk]- By applying standard con- 
Let q^{Qk,\Hk,nlsk,n) = ^in{gk,nh\QkAHk,nlsk,n.Pk,n)''^^ optimization techniques and Lemma H (AI^^(Qfc) = 

^'^;^ NFSw^{Qk)), the optimal solution to (|43]) is given by (1331) . 

AW^{Qi)r 2x 

__ Sk,niOg[L^Pk,n\^k,n\ ) APPENDIX El PROOF OF LEMM A [6] 



nW^{Ql^Ak)\Qk] 



Nf 



We first fix K and consider the growth of the ergodic 
7V^ J visiting speed w.r.t. N. As A^ increases, the number of per- 

(42) user per-subband state-action pair observations made at each 



;} 



time slot increases (this"parallelism" helps to speed up the 
convergence rate). Thus, the chance that all per-user per- 
subband state-action pair of each user are visited grows like 
0{N), and hence, the ergodic visiting speed of each user 
grows like 0{N). Next, we fix N and consider the growth of 
the ergodic visiting speed w.r.t. K. Each subband can only be 
allocated to one user. Thus, the chance of the bottleneck state- 
action pair with s = 1 for each user being visited decreases 
like 0{K), and hence, the ergodic visiting speed of each 
user grows like 0{1/K). Combine the above two cases, we 
conclude Lemma [6l 

Appendix G: Proof of Theorem[2] 

For given 7, we shall prove that under a Best-CSI subband 
allocation policy, the Q-factor satisfying the Bellman equation 
([T3]) can be decomposed into the additive form in ([T5l) . Based 
on that, we shall show that for large K, the linear Q-factor 
approximation in ([T5]) is indeed optimal. 

Definition 2: [Best-CSI Subband Allocation Policy] A 
Best-CSI subband allocation policy is defined as l^s(H) = 

{5/e,n(Hn) G {0, 1}| Y^k^l ^k,n = 1 Vn}, whcrC 



Sk,' 



J 

=l[\Hk^n\ > max\Hj^n\ 



(44) 



We first establish a property of the Q-factor in the original 
Bellman equation in ([T3]) under the Best-CSI subband alloca- 
tion policy, which is summarized in Lemma [71 

Lemma 7: (Additive Property of the Subband Allocation Q- 
Factor) Under the Best-CSI subband allocation policy, the 
solution to the original Bellman equation in ([T3]) can be 
expressed into the form Q(x,s) = ^^Q^iXk^^k), where 
{Q^(%^,s/e)} is the converged per-user Q-factor, which is 
also the solution of the k-\h user's per-user subband allocation 
Q-factor fixed point equation given by ([T6l) . 

Proof: Under the Best-CSI subband allocation policy, the 
Bellman equation in ([T3]) becomes 



Q(x\s) = min 



g{l,x\^.%{x')) 






no 



VI < i < /v, Vs 



y(Q^- 



X' 



(45) 



^V(Q^) = EPr[Hi min 






g{-f,x\^sm.^p{x')) 

9, l<i<lQ 



(46) 



where (a) is due to (|7]) and the definition V{Q) = 
E[Q(x,(^5(H))|Q], (b) is obtained by taking conditional 
expectation (conditioned on Q*) on both sides of J45]) and 
the definition o£ y(Q). In addition, denote AkV{Q) = 
E[V{Q + A) - F(Q + A - ek)\Q]. 



J'rom (|45]) , we know that {Q(x*,s)}_is determined by 
{V{Q')}. Next, we shall try to solve {V{Q')} by the Iq 
equations in (l46l) . First, assume the linear approximation 
Q(X,^.(H)) = EkQ^iXk.^sC^)) holds under the best- 
CSI subband allocation policy, we have 

nQ)=E[^Q'=(Xfc,f^J(H))|Q] 

k 
k 
k 

= VEfE[Q'=(Qfe,Hfe,{sfe,„ = l[|fffc,„| > max|iJ,-„|]}) 

, L j^k 

k 



AkViQ) =E[^ WHQ, + Aj) - ( ^ WHQj + Aj) 

3 j^k 

+ W^iQk ^Ak- 1)) IQ] = AW^iQk) 

Thus, the optimal power allocation and corresponding condi- 
tional departure rate to min^^^^^i) [•] part in (l46l) are as follows 



P/c,n(Q/e, |^/e,n|,5/e,n(Hn)), V/c, n 



=5fe 



.^AWHQk) 1 ,+ 

n(tln)y — 



7'= \Hk,n\^ 

/Xfe((5fc,Hfc,Sfe(H)), Vfc 



(47) 



(48) 



N> 



= ^ Sfe_„(H„) log(l + Pk,n{Qk, |i?fc,n|, Sfc,n(H„))|iJfc,„P) 



Therefore, from ( l46l ). we have 

E^'(^fc) = E {9kh\ Qi) + m'iQi + Ak)\Qi] 

k k 

-MQi)rAW\Ql))-8 
^e = Y,0'' = Y. {9k{l\ Ql) + m\Ql + Ak)\Qi] 

k k 

-MQl)r^W\Qi) -W\Ql)), 



l<i<lQ (49) 



where 5fc(7^Ql) = ^likf{Qk) + 

t{T.nPkAQkAHk,nl^kA^n)) " Pk) + f {^1 = 

Nq] - Pk)\Qk\ and MQk) = E[/Xfe(Qfe,Hfc,Sfe(H))|Qfe]. 
Since there are only {Nq + 1) QSI states for each user and the 
structure in ( |49] ) is decoupled under the additive assumption, 
for each user k, there are only {Nq + 1) independent Poisson 
equations with Nq + 2 unknowns {0'',W''{Qk)}. Ok is 
unique and {W''{Qk)} is unique up to an additive constant 
[13]. Therefore, J6',V^(Q)} is^the solution to dSl, where 
e = ^2,9" mdV{Q) = j:,wHQk). 
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Next, we shall^how Q(x,s)^ ^^ Q^iXk^^k)- Substitute 
^ = T.k^^ and V{Q) = Efc W^iQk) into gS]), we have 

Q(x^s) = min [^(7, %', s, ^p(xO) 

+ E p^[Q' l^^ ^' ^^(^^)] ( E ^'(Qi))] - E ^' 

Qi /e ^ I 



= E2'(%Lsfc) 



where Q^ixl^^k) = 



^Pfc 



^fe(7^,%*fc,Sfe,Pfe) 



0^, which is equivalent [15] 



EQiPAQi\xl,sk,Pk]wHQi)\ 

to ([T6l) . By Lemma O the converged {Q^(%fe7Sfe)} satisfy 
([T6l) and this completes the proof. ■ 

Next, we shall consider the asymptotic subband allocation 
results for large K. The optimal control actions to ([T3l) are 
given by 



Pfc,n(Hn,Q) =5fe,n(Hn,Q)f — ^^ 



7 



\m 



S/c,n(Hn, Q) 



(i: 



if Xk,n = maxj- {Xj>} > 



otherwise 



(50) 
(51) 



where F*(Q) 
E[y*(Q + A) 



4^E[minsQ*(x,s)|Q], Afcy*(Q) : 
- l^*(Q + A - efe)|Q] and Xfc,„ : 

^A,y*(Q)log (l + |i/,,„|2(i^^ _ ^^)+) 

. Denote /c* = argmax/e |i^/c,n| 



N 



^^(^^ 



Afcy*(Q) 



For large K, |ilf/e*,nP grows with log(i^) by extreme value 
theory. Because the traffic loading remains unchanged as 
we scale up K, max/e,^- |A/eV*(Q) - A^-V*(Q)| = 0(1). 
Hence, X/e*,n grows like log(log(i^)). As i^ ^ cx), Pr[/c* = 
argmax/eX/c,n] = 1- Thus the subband allocation result of 
optimal subband allocation in (ISTt and the best CSI subband 
allocation in (l44l) will be the same for large K. Using the result 
in Lemma U\ the linear Q-factor approximation is therefore 
asymptotically accurate for given 7. Combining the results of 
Theorem [H we can prove Theorem [2l 
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